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ABSTRACT 

Music structure refers to the description of the long term 
organization of a music piece through a sequence of struc¬ 
tural segments. A structural segment can be defined by its 
structural borders (a start time, an end time) and a label 
reflecting the similarity of its music content compared to 
the other segments’. Its duration is typically around 16 s 
and more. 

This document presents the music structure estimation 
system submitted to MIREX’s structural segmentation task 
in 2012. It is composed of three steps : feature extraction, 
structural border estimation and segment labeling. First, 
the system produces a sequence of chroma vectors [6] ex¬ 
pressed at the snap scale [1] (section 1). This sequence 
is used to calculate a segmentation criterion based on a 
morphological model of the structural segments [2] (sec¬ 
tion 2.1). The structural border estimation is performed by 
searching the segmentation with lowest cost, which com¬ 
bines this criterion and a regularity constraint (section 2.2). 
The segments are then labeled by clustering according to 
their similarity, through the minimization of an adaptive 
model selection criterion (section 3). 

1. FEATURE EXTRACTION 

The extraction of the sequence of chroma vectors of size 
12 used to describe the music content of the piece is per¬ 
formed by means of the “Chroma Toolbox” by Muller and 
Ewert [6]. We use the CP features regularly and a hop of 
0.1 s. 

Then, they are expressed at the snap scale. The snap is 
here defined as the multiple of a beat whose period is closer 
to 1 s. The snap scale is synchronous to the downbeat scale, 
and they are often equal in practice. The beat and downbeat 
estimations are performed thanks to the MATLAB imple¬ 
mentation by Davies et al. [4,5]. The downbeat estimator 
is tuned so as to consider 4 beats per bar. 

We associate to each snap the mean of the CP features 
contained in the window centered on the snap that lasts the 
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duration of the snap period. 

2. STRUCTURAL BORDER ESTIMATION 

2.1 Morphological model 

We assume that a structural segment can be characterized 
by its inner organization, according to its musical lay¬ 
ers (timbre, harmony, melody ...). In this scope we con¬ 
sider the system and contrast model by Bimbot et al. [2]. 
It considers that each structural segment aimed is built 
from an group of typically four morphological elements 
of four snaps, we note {ai, a 2 , ug, a/^}. The three first el¬ 
ements are related by simple transformations / and g so 
as 02 = /(< 2 i) and ag = g{ai). The fourth element can 
either follow the logic of the three elements and then form 
a system (a^ = f{g{ai))) or on the contrary contrasts with 
it (a 4 = 6{f{g{ai))) where 6 ^ id). Note that we assume 
that the relevant layers for structural analysis can vary from 
one structural segment to another. 

However, in much cases, either / = id or g = id, 
or both. This leads to observe usual morphological mo¬ 
tives like aaaa, abab, aabb in the case of systems with no 
contrast, or aaab^ abac^ aabc in the case of systems end¬ 
ing with a contrast. These motives can be extended to the 
case where the identity function id is replaced by “close 
to identity” functions id' (aaa'b^aba'c^aa'bc^...). More 
information on this model can be found in [3]. 

2.2 Segmentation criterion 

The aim is to evaluate for each time unit considered the 
likelihood that it corresponds to the beginning of a vyv- 
tem. We assume that at least one of the relations (/ or/and 
g) between the elements of a system equals the identity 
function. For each snap t G [1,T] of a music piece, we 
consider the analysis window of size = 16 snaps so 
as to consider three morphological elements starting from 
t (ai, a 2 , as), and one morphological element before this 
snap (ao) as represented in figure 1. We consider that the 
size of each morphological element is = 4 snaps. The 
criterion 4> we consider in this work results from the linear 
combination of two quantities : 


^ ^System ^2a’(^ontrast 
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X 
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Figure 1. Analysis window used for segmentation crite¬ 
rion calculation, containing the sequence of features x. It 
is composed of 4 small windows, each one related to the 
position of a morphological element of size 4 snaps, 
i = {0,l,2,3}. 


with Ai and A 2 G learnt on a training database ^ . 

^System (^) Quantifies the likelihood that t corresponds 
to the start time of a system through the analysis of the 
similarity of the morphological elements a 2 and as with 
regard to ai. Be x = {X^}i<^< 7 v the sequence of fea¬ 
tures contained in the analysis window related to snap 
t. Let us define X = {Xq,X i,X 2 ,X 3 }, with vector 
Xi = {xi+iN^,...,XN^+iN^} fori = {0,1,2,3}, and 
Y = [YuY 2 ] = [{X2 - Xi)2, (X 3 - Xi)2]. Xi contains 
the sequence of features of a^, and Yj is the squared dis¬ 
tance between the features of a^+i and ai, for each of its 
dimensions. We have : 


. . _ YlJ=iXam{Yi{j),Y2{j)) 

^SystemW - 


( 2 ) 


where ||X^|| corresponds to the 1 2 norm of vector X^. 
^System (^) ^^^ws to evaluate the contribution of Xi to 
explain either X 2 or X 3 according to the various dimen¬ 
sions of the features. A high contribution implies that a 
system is likely to begin at snap t in the piece. 

In the scope of the system and contrast model, a struc¬ 
tural segment is likely to begin at t if 02 and/or a 3 is sim¬ 
ilar to ai. This is considered through If the 

preceding structural segment is different from the current 
one, the third and the fourth morphological elements differ 
from ai. If the two segments are the same, then the third 
element may be similar to ai, but the fourth one generally 
differs from it. We therefore introduce cr(^ontrast(^) which 
evaluates the dissimilarity between the morphological ele¬ 
ment preceding snap t we note ao, and ai. We choose to 
formulate it as follows : 


^Contrast ~ 5 Xi) (3) 

where cotan(X^,Xj) corresponds to the cotangent of the 
angle between vectors Xi and Xj . 

2.3 Regularity constraint 

We consider a regularity constraint in the structural border 
estimation to favor segmentations with segments close to a 

^ RWC Popular database with structural annotations from [1] were 
used for parameter tuning. 


typical segment size or structural pulse r = 16 snaps . Let 
m be the size of a structural segment: 

TYl 

^a(m) = |--ir (4) 

r 

with a G a factor which controls the convexity of this 
function. The use of a non-convex function will favor seg¬ 
mentations with a majority of segments of size equal to 
r and few segments whose size is far from this structural 
pulse. On the opposite, a convex function loosen this con¬ 
straint. 

2.4 Performing the structural border estimation 

The segmentation criterion and the regularity constraint are 
combined through a linear combination to form a segmen¬ 
tation cost C : 


C=(l-A3)^ + A3^a (5) 


with A 3 G [0,1]. 

We use the Viterbi algorithm described in [ 8 ] to find the 
segmentation with the lowest segmentation cost. 

3. SEGMENT LABELING 

The labeling of the obtained segments is performed by the 
method described in [9]. Here, we transform the chroma 
sequence in a symbolic sequence by means of vector quan¬ 
tization, with the number of chroma clusters empirically 
fixed to 16. The Edit Distance on the symbolic features 
used to compare the content of the segments is replaced by 
a stripe distance [7] on the corresponding numeric chroma 
features. As we consider that the content of the fourth mor¬ 
phological element can be very variable as in section 2 . 1 , 
we only consider the three fourth of the segment only when 
it lasts 16 snaps or more. 
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